Massive Algorithms: H-Index [LeetCode] | Training dragons the hard way

H-Index [LeetCode] | Training dragons the hard way - Programming Every Day!

Given an array of citations (each citation is a non-negative integer) of a researcher, write a function to compute the researcher's h-index.

According to the definition of h-index on Wikipedia: "A scientist has index h if h of his/her N papers have at least h citations each, and the other N − h papers have no more than h citations each."

For example, given citations = [3, 0, 6, 1, 5], which means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively. Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, his h-index is 3.

Note: If there are several possible values for h, the maximum one is taken as the h-index.

As we know, H-Index is a measure of productivity and citation impact of a researcher. It is also called Hirsch index or Hirsch number after the name of the physicist Jorge Hirsch.

Picture 1: H-Index for decreasing citations array (Credit: Wiki)

Supposed that citations is an array of numbers of citations of a researcher. We can easily see that the following formula holds if citations is sorted in decreasing order:

h-index of citations = max (min (citations[i], i) for all i from 1 to number of papers)

X. O(N)
We have some simple observation here, but it really helps to improve the performance. For each number h from 0 to n where n is the number of papers, we need to find out how many citations of a paper equal h called equal_h. Based on this, we can find the number of citations values that is at least h, and no more than h. To find the number of citations value that is at least h, we take sum of equal_h[i] for i from h to n!!! And similarly, to find the number of citations values that is no more than h citations each, we can sum up equal_h[i] for ifrom 0 to i. And we can find the h-index by simple running from the back of the array equal_h.
So if running from the back of the array equal_h , and h is the first index that satisfies the following conditions:
equal_h[h] + equal_h[h+1] + ... + equal_h[n] >= h (*)
equal_h[0] + equal_h[1] + ... + equal_h[h] >= N-h (**)
Then h is what we are looking for.
However, we have:
equal_h[0] + equal_h[1] + ... + equal_h[n] = n
Therefore:
equal_h[0] + equal_h[1] + ... + equal_h[h] = n- (equal_h[h+1] + ... + equal_h[n])
Another note is that since h is the first element satisfies the 2 above conditions, then h+1 does not satisfies one of them, meaning that either
(1): { equal_h[h+1] + equal_h[h+2] + ... + equal_h[n] <= h
or (2): { equal_h[h+1] + equal_h[h+2] + ... + equal_h[n] >= h+1
and equal_h[0] + equal_h[1] + ... + equal_h[h+1] < N-(h+1) }
(1) suggests that:
equal_h[0] + equal_h[1] + ... + equal_h[h] >=N-h which is (**)
(2) suggests that:
equal_h[h+2] + equal_h[h+3] + ... + equal_h[n] >= h+2
This inequality will be repeated until equal_h[n] >= n , which is wrong.

So all we need is to find the first h satisfies the condition (*), and we do not need to check the condition (**).

public int hIndex(int[] citations) {
int n = citations.length;
int [] equal_h = new int[n+1];
for (int h = 0; h<n; h++){
if(citations[h] >= n) equal_h[n] += 1;
else equal_h[citations[h]] += 1;
}
int s = 0; //we don't need check overflow here coz sum always <= n
for (int h = n; h>0; h--){
s += equal_h[h];
if (s >= h) return h;
}
return 0;
}

http://segmentfault.com/a/1190000003794831

时间 O(NlogN) 空间 O(1)

先将数组排序，我们就可以知道对于某个引用数，有多少文献的引用数大于这个数。对于引用数citations[i]，大于该引用数文献的数量是citations.length - i，而当前的H-Index则是Math.min(citations[i], citations.length - i)，我们将这个当前的H指数和全局最大的H指数来比较，得到最大H指数。

    public int hIndex(int[] citations) {
        // 排序
        Arrays.sort(citations);
        int h = 0;
        for(int i = 0; i < citations.length; i++){
            // 得到当前的H指数
            int currH = Math.min(citations[i], citations.length - i);
            if(currH > h){
                h = currH;
            }
        }
        return h;
    }

http://www.neozone.me/leetcode274.html

时间 O(N) 空间 O(N)

也可以不对数组排序，我们额外使用一个大小为N+1的数组stats。stats[i]表示有多少文章被引用了i次，这里如果一篇文章引用大于N次，我们就将其当为N次，因为H指数不会超过文章的总数。为了构建这个数组，我们需要先将整个文献引用数组遍历一遍，对相应的格子加一。统计完后，我们从N向1开始遍历这个统计数组。如果遍历到某一个引用次数时，大于或等于该引用次数的文章数量，大于引用次数本身时，我们可以认为这是H指数。之所以不用再向下找，因为我们要取最大的H指数。那如何求大于或等于某个引用次数的文章数量呢？我们可以用一个变量，从高引用次的文章数累加下来。因为我们知道，如果有x篇文章的引用大于等于3次，那引用大于等于2次的文章数量一定是x加上引用次数等于2次的文章数量。

    public int hIndex(int[] citations) {
        int[] stats = new int[citations.length + 1];
        int n = citations.length;
        // 统计各个引用次数对应多少篇文章
        for(int i = 0; i < n; i++){
            stats[citations[i] <= n ? citations[i] : n] += 1;
        }
        int sum = 0;
        // 找出最大的H指数
        for(int i = n; i > 0; i--){
            // 引用大于等于i次的文章数量，等于引用大于等于i+1次的文章数量，加上引用等于i次的文章数量 
            sum += stats[i];
            // 如果引用大于等于i次的文章数量，大于引用次数i，说明是H指数
            if(sum >= i){
                return i;
            }
        }
        return 0;
    }

http://blog.welkinlan.com/2015/11/05/h-index-i-ii-leetcode-java/

    public int hIndex(int[] citations) {

        int n = citations.length;

        int[] count = new int[n + 1];

        for (int i = 0; i < n; i++) {

            if (citations[i] > n) count[n]++;

            else count[citations[i]]++;

}

        for (int i = n; i > 0; i--) {

            if (count[i] >= i) return i;

            count[i-1] += count[i];

}

        return 0;

}

https://github.com/kamyu104/LeetCode/blob/master/C++/h-index.cpp

	int hIndex(vector<int>& citations) {
	const auto n = citations.size();
	vector<int> count(n + 1, 0);
	for (const auto& x : citations) {
	// Put all x >= n in the same bucket.
	if (x >= n) {
	++count[n];
	} else {
	++count[x];
	}
	}

	int h = 0;
	for (int i = n; i >= 0; --i) {
	h += count[i];
	if (h >= i) {
	return i;
	}
	}
	return h;
	}

http://bookshadow.com/weblog/2015/09/03/leetcode-h-index/

使用长度为N + 1的数组cnts记录引用次数在0 ~ N（N篇以上的记为N）的文章个数

然后遍历cnts数组，计算h值的最大值

def hIndex(self, citations): N = len(citations) cnts = [0] * (N + 1) for c in citations: cnts[[c, N][c > N]] += 1 sums = 0 for h in range(N, 0, -1): if sums + cnts[h] >= h: return h sums += cnts[h] return 0
X. O(NlogN)
http://segmentfault.com/a/1190000003794831

    public int hIndex(int[] citations) {
        // 排序
        Arrays.sort(citations);
        int h = 0;
        for(int i = 0; i < citations.length; i++){
            // 得到当前的H指数
            int currH = Math.min(citations[i], citations.length - i);
            if(currH > h){
                h = currH;
            }
        }
        return h;
    }

http://www.neozone.me/leetcode274.html
现将数组排序。然后从大到小遍历，一边计数一边比较计数与当前的引用数，直到计数大于引用数。

    public int hIndex(int[] citations) {

        if(citations == null || citations.length == 0) return 0;

        Arrays.sort(citations);

        int count = 0;

        for(int i = citations.length - 1; i >= 0; i--){

            if(++count > citations[i]) {

                count--;

                break;

}

}

        return count;

}

http://www.cnblogs.com/grandyang/p/4781203.html
Therefore, we can easily implement the calculation of h-index by sorting the array in decreasing order then applying the above formula.

def hIndex(self, citations):
"""
:type citations: List[int]
:rtype: int
"""
citations.sort(reverse=True)
return max([min(k+1, v) for k,v in enumerate(citations)]) if citations else 0

https://leetcode.com/discuss/55924/o-nlogn-12ms-solution
int hIndex(vector<int>& citations) { if(citations.empty()) return 0; sort(citations.begin(), citations.end()); int n=citations.size(); int i=0; while(i<n && citations[i]<(n-i)) i++; return n-i; }

这道题让我们求H指数，这个质数是用来衡量研究人员的学术水平的质数，定义为一个人的学术文章有n篇分别被引用了n次，那么H指数就是n。而且wiki上直接给出了算法，可以按照如下方法确定某人的H指数：1、将其发表的所有SCI论文按被引次数从高到低排序；2、从前往后查找排序后的列表，直到某篇论文的序号大于该论文被引次数。所得序号减一即为H指数。

    int hIndex(vector<int>& citations) {
        sort(citations.begin(), citations.end(), greater<int>());
        for (int i = 0; i < citations.size(); ++i) {
            if (i >= citations[i]) return i;
        }
        return citations.size();
    }

https://github.com/kamyu104/LeetCode/blob/master/C++/h-index.cpp
Not good idea to use iterator or for loop here, as we need use the index.


	int hIndex(vector<int>& citations) { sort(citations.begin(), citations.end(), greater<int>());
	int h = 0;
	for (const auto& x : citations) {
	if (x >= h + 1) {
	++h;
	} else {
	break;
	}
	}
	return h;
	}

(2)：按照文章引用次数正序排列，循环遍历数组，统计min(引用次数c, 剩余文章篇数N-i)的最大值
def hIndex(self, citations): N = len(citations) for i, c in enumerate(sorted(citations)): if N - i <= c: return N - i return 0
Techinpad: [LeetCode] H-Index

https://leetcode.com/discuss/56034/four-lines-in-java-8-using-functional-programming

public int hIndex(int[] c) { if (c == null) { return 0; } List<Integer> rsorted = Arrays.stream(c).mapToObj(i->Integer.valueOf(i)).
sorted(Collections.reverseOrder()).collect(Collectors.toList()); return IntStream.range(0,rsorted.size()).
reduce(0,(hindex,curindex)->Math.max(hindex,Math.min(curindex+1,rsorted.get(curindex)))); }
http://www.fgdsb.com/2015/01/07/find-h/

一个array里面找最大的这样的h: 至少有h个数大于等于h。
比如{2,3,5}答案是2，{5,6,7,8}答案是4。

可以用quick select的思路来做，平均时间复杂度O(n),空间O(1)

可以用quick select的思路来做，平均时间复杂度O(n),空间O(1)


int find_h_s( std::vector<int> arr ){

    int ret = 0;

    for(int low = 0, high = (int)arr.size(); low < high;){

        int n = low + (high - low) / 2;

        nth_element(arr.begin()+low, arr.begin()+n, arr.begin()+high, greater<int>());

       

        if(arr[n] >= n+1){

            ret = max(ret, n+1);

            low = n+1;

        } else {

            high = n;

        }

    }

    return ret;

}

如果允许extra memory的话可以用counting的思路做，更简单，时间和空间都是O(n)


int find_h_c(vector<int> a) {

    int n = (int)a.size();

    vector<int> m(n+1, 0);

    for(auto num : a) {

        if(num >= n) ++m[n];

        else ++m[num];

    }

    for(int i = n, sum = 0; i > 0; --i) {

        sum += m[i];

        if(sum >= i) return i;

    }

    return 0;

}

还可以先排序，然后直接扫一遍，时间复杂度就是排序的复杂度。


int find_h_sort(vector<int> a) {

    sort(a.begin(), a.end(), greater<int>());

    

    for(int i = (int)a.size(); i > 0; --i) {

        if(a[i-1] >= i) return i;

    }

    return 0;

}

Read full article from H-Index [LeetCode] | Training dragons the hard way - Programming Every Day!

H-Index [LeetCode] | Training dragons the hard way - Programming Every Day!

Labels

Popular Posts